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Abstract. A priori dimension reduction is a widely adopted technique for reducing 
the computational complexity of stationary inverse problems. In this setting, the 
solution of an inverse problem is parameterized by a low-dimensional basis that is 
often obtained from the truncated Karhunen-Loeve expansion of the prior distribution. 
For high-dimensional inverse problems equipped with smoothing priors, this technique 
can lead to drastic reductions in parameter dimension and significant computational 
savings. 

In this paper, we extend the concept of a priori dimension reduction to non¬ 
stationary inverse problems, in which the goal is to sequentially infer the state of 
a dynamical system. Our approach proceeds in an offline-online fashion. We first 
identify a low-dimensional subspace in the state space before solving the inverse 
problem (the offline phase), using either the method of “snapshots” or regularized 
covariance estimation. Then this subspace is used to reduce the computational 
complexity of various filtering algorithms—including the Kalman filter, extended 
Kalman filter, and ensemble Kalman filter—within a novel subspace-constrained 
Bayesian prediction-and-update procedure (the online phase). We demonstrate the 
performance of our new dimension reduction approach on various numerical examples. 
In some test cases, our approach reduces the dimensionality of the original problem 
by orders of magnitude and yields up to two orders of magnitude in computational 
savings. 


Keywords: State estimation, Bayesian filtering, Kalman filter, ensemble Kalman filter, 
dimension reduction. Submitted to: Inverse Problems 


1. Introduction 

An inverse problem converts noisy, incomplete, and possibly indirect observations to 
characterizations of some unknown parameters or states of a physical system. These 
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unknowns are often functions defined on a spatial domain and linked to the observables 
via a forward model. High dimensionality due to the numerical discretization of the 
unknown functions is often viewed as one of the grand challenges in designing scalable 
inference methods. This challenge motivates the development of dimension reduction 
approaches, which exploit the possibly low-dimensional intrinsic structure of the inverse 
problem, to alleviate the effect of the “curse of dimensionality.” 

A typical inverse problem is ill-posed; the unknowns are not uniquely identihed 
by the observations. This is a joint effect of noisy, incomplete observations and the 
smoothing properties of the forward model. In the Bayesian inference framework [46, 27], 
ill-posedness is addressed by employing a suitable prior distribution and characterizing 
the posterior distribution of the unknowns conditioned on the observations. In this 
setting, the priors often encode structural information about the unknowns, such as 
spatial smoothness properties. This a priori structural information opens up the 
possibility of prior-based dimension reduction, especially in cases where the variation of 
the high-dimensional unknowns can be explained by a small number of basis functions. 
For instance, the truncated Karhunen-Loeve expansion [28, 34] of the prior distribution 
is employed in [35] for identifying such a a priori low-dimensional basis in static inverse 
problems. Computational cost can be greatly reduced by projecting the original high¬ 
dimensional unknowns onto the subspace spanned by the resulting low-dimensional 
basis. 

In this paper, we extend this concept of a priori dimension reduction to non¬ 
stationary inverse problems, in which the goal is to sequentially infer the state of a 
dynamical system. Such problems can be solved efficiently using hltering methods, 
where the posterior prediction from the previous time step is used as the prior for the 
current state, and the new posterior is obtained by conditioning the current state on 
data observed at the current time. The computational difficulty of applying hltering 
methods to high-dimensional problems stems both from propagating the distribution 
of the high-dimensional state forward in time and from solving the high-dimensional 
inference problem when the new data is observed. 

To reduce the computational complexity of hltering methods, our proposed 
dimension reduction method is applied in an offline-online fashion. In the offline phase, 
we identify a low-dimensional subspace of the state space before solving the inverse 
problem, using either the method of snapshots or regularized covariance estimation. 
In the online phase, the computational complexity of various (Gaussian) hltering 
algorithms—including the Kalman hlter, extended Kalman hlter, and ensemble Kalman 
hlter—is reduced by constraining the update and the prediction steps within the 
resulting subspace, in a unihed subspace-constrained Bayesian framework. 

The success of the proposed approach naturally requires that the unknown states 
can be captured by a low-dimensional basis. This is the case, for instance, if either 
the model states are sufficiently smooth or the states can be captured by a low¬ 
dimensional attractor. We show numerical examples where the reduction provides 
signihcant computational savings in diherent ways—by reducing the dimension of the 
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linear systems involved in the prediction and update steps, but also, in ensemble filtering 
approaches, by reducing the number of ensemble members required. We discuss the 
limitations of the proposed approach in cases where the state cannot be represented 
efficiently in a fixed low-dimensional subspace. We also discuss several different ways to 
construct the subspace based on existing “snapshots” of the model states. 

The idea of reducing the dimension of hltering algorithms has been investigated 
before. The closest existing algorithm to our approach is the reduced-order Kalman 
filter (ROKF) [9, 25], where dimension reduction is obtained by projecting the model 
dynamics onto a fixed low-dimensional subspace. While the ROKF shares some 
similarities with the present approach, fundamental differences remain: our algorithms 
do not explicitly project the model dynamics onto the subspace, but rather constrain 
the inference (update) step using the subspace. We show that the latter strategy yields 
more appropriate prior distributions for each inference problem. Moreover, we extend 
the discussion of dimension reduction to ensemble filtering methods. Differences between 
the present approach and the ROKF are analyzed in detail later in the paper. 

Another related approach is the reduced-rank Kalman filter (RRKF) of [18]. In 
this approach, one propagates prediction uncertainties only along directions in the 
state space where the variance of the states grows most quickly (the so-called Hessian 
singular vectors). The difference with our approach is that the subspace of RRKF 
is re-computed at each hltering step through the solution of an eigenvalue problem, 
whereas in our approach the basis is hxed and computed offline. Thus, RRKF is a local- 
in-time approach, based on local linearization, whereas our approach tries to hnd a 
low-dimensional subspace for the state based on an analysis of global-in-time dynamics. 
Our approach is less computationally intensive, but its applicability may be restricted 
to cases where a global low-dimensional representation for model states exists. Our 
numerical examples, however, demonstrate that this simple strategy can yield signihcant 
computational savings in a range of hltering algorithms. 

In [10], dimension reduction is sought not in the Gaussian hltering context, but 
for a sequential Monte Carlo (particle hltering) method. Again, dimension reduction is 
performed locally in time. Filtering is restricted to coordinates spanning the most 
unstable modes around the current nominal trajectory; interestingly, for spatially 
distributed systems, these unstable modes are often the low-wavenumber components 
of the state. 

Dimension reduction approaches for hltering problems can be benehcial in many 
ways. For instance, they reduce memory requirements, which are prohibitively large 
for high-dimensional problems when standard Kalman hlters are applied. Indeed, 
memory constraints have motivated the development of various approximate hltering 
methods [3, 4]. The dimension reduction approach presented here reduces memory 
requirements, but also ohers speedups that may be benehcial for (even smaller scale) 
real-time estimation and control problems, e.g., chemical process tomography [42]. As 
noted above, speedups also extend to ensemble hltering methods: constraining the 
inference step onto a subspace implicitly regularizes the problem, and thus reduces 
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the number of ensemble members required to achieve a given accuracy. 

The paper is organized as follows. In Section 2, we review prior-based dimension 
reduction for static inverse problems and develop the linear-Gaussian filtering equations 
for subspace coordinates. In Section 3, we discuss how the method applies to extended 
Kalman hltering and ensemble methods. Section 4 discusses techniques for constructing 
the low-dimensional subspace. Section 5 analyzes the differences between our approach 
and the ROKF. In Sections 6.1-6.2, we study the behavior of the dimension reduction 
approach via linear and nonlinear examples. Section 7 offers some concluding remarks. 

2. Prior-based dimension reduction 

2.1. Static problems 

Our starting point is the prior-based dimension reduction technique for static inverse 
problems, which we briefly review here. The unknown function x{s), s G O, is defined 
in some spatial domain O. Discretizing x{s) on a grid defined by a set of nodes 
and some basis functions yields a d-dimensional vector x = [x(si),..., x(sd)]''" G M'^. 
The discretized unknown vector x is related to observations y G M™' via the model 

y = /(x) + £, (1) 

where e N(0,R) and / is a (possibly nonlinear) mapping from the unknown x to the 
observable output. Moreover, let us assume that we have a Gaussian prior x r\j N(^,S). 
Then, the posterior density for x is 

p(x|y) oc exp (||y - /(x)||^ + ||x - ^|||)^ , (2) 

where ||b||^ denotes the quadratic form b^A“^b. 

The idea in prior-based dimension reduction is to constrain the problem onto a 
subspace that contains most of the variability allowed by the prior; see, for instance 
[35]. This can be done by computing the singular value decomposition (SVD) of the 
prior covariance matrix” S = UAU^, where U G contains the singular vectors 

ui,U 2 ,...,Urf (as colums) and A = diag(Ai,..., A^) has the singular vectors in the 
diagonal. Dimension is reduced by representing the unknown as a linear combination 
of the r leading (scaled) singular vectors: 

X = ^-hPrQ: with Pr = UrAy^ = [\/AiUi, \/A 2 U 2 , . . . , \/ArUr]. (3) 

Inserting this parameterization into the problem leads to the following posterior density 
for subspace coordinates ct: 

p(Q:|y) oc exp (||y - + Pr.Q:)||K + ||P^Q:|||)^ . (4) 

It is easy to verify that the prior term simplifies to ||PrQ:|||; = Q:^(P^S“^Pr)Q: = ||Q:||y, 
where I,- is the r x r identity matrix. The posterior can be thus written simply as 

p(a|y) oc exp (||y - + P,.a)|||^ + ||Q:||y)^ , 


( 5 ) 
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and the dimension of the inverse problem has been reduced from d to r. This can be 
helpful, for instance, when MCMC samplers, which are challenging to apply in high¬ 
dimensional problems, are used to quantify uncertainty in the parameters. 

If the model is linear, /(x) = Fx, the posterior is Gaussian N(Q:pos, ^'pos) with 
mean and covariance matrix given by 

^'pos = {{FPrVK-\FPr) + lr)~" (6) 

CKpos = ^'pos(FPr)^R"^(y - F^). (7) 

Thus, one needs to apply the model to the r columns of and solve r-dimensional 
linear system, which is computationally much easier than solving the full problem if 
r -C d. 

2.2. Dynamical problems 

Here, we discuss how dimension reduction can be implemented for dynamical state 
estimation problems. Let us begin with the following linear Gaussian state space model: 

Xfc = MfcXfc_i -h Efc (8) 

Yfc = HfcXfc -h efc. (9) 

In the above system, G is the forward model that evolves the state in time 
and Hfc G is the observation model that maps the state to the observations. The 

model and observation errors are assumed to be zero mean Gaussians: N(0,Q,) 

and efc N(0,Rfc) with known covariance matrices Qa, G and R^ G 

The linear Gaussian problem can be solved with the Kalman hlter, which proceeds 
sequentially as follows. Assume that, at time step A:, the marginal posterior is the 
following Gaussian: 

XA,_i|yi,fc_i ~ N(x“_i, C^_i). (10) 

The prediction step involves propagating this Gaussian forward with the model M^, 
which yields the Gaussian 

Xfc|yi:fc -1 ~ N(x{,c{), (11) 

where x{ = Ma;X^_;^ and C{ = MA;C^_]^M^-|-Qfc. Throughout the paper, we follow the 
notation commonly used in weather forecasting and data assimilation literature: we use 
the superscript a to refer to the “analysis” (posterior) estimate updated with the most 
recent observations, and the superscript / to refer to the “forecast” (prior) estimate. 

In the update step, the predicted Gaussian is updated with the new observations 
that become available. The resulting posterior density is 

p(xfc|yi:fc) oc exp (^||x{ - Xfc||^/ + ||yfc - , (12) 

which is, again, Gaussian with known mean and covariance matrix given by the standard 
Kalman hlter formulas, which we choose not to rewrite here (see any standard textbook 
on the subject, e.g., [45]). 
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The most straightforward application of the prior-based dimension reduction 
technique in dynamical problems would be to dehne separately for each hlter step via 
the leading singular values and vectors of the prior covariance matrix C{. This approach 
has a few potential problems. First, computing the local leading singular vectors at 
each time step can be a computationally challenging task. Second, by truncating the 
prior covariance, we might discard directions that seem less important (i.e., that have 
low prior variance) locally in time, but that become relevant at a later time step. In 
our experiments, this approach led to inconsistent behavior of the hlter; good hltering 
accuracy was obtained for some cases, but in other cases the hlter performed poorly or 
even diverged. 

Here, we examine an alternative, simpler strategy, where a global subspace is 
constructed a priori (before the hltering is started) and is then hxed for the hltering. 
This approach is motivated by the fact that the state of a dynamical system often lives 
in a subspace of much smaller dimension than the full state space; the state vector often 
has some properties (e.g., smoothness) that enable it to be ehectively described in a 
low-dimensional subspace. If we can capture the subspace where the essential dynamics 
of the system happen, we can potentially reduce the whole hltering procedure onto this 
subspace. This idea is discussed here. 

Now, we parameterize the unknown as x^. = -|- Pr-oifc, where G is a 

hxed reduction operator that does not change in time and x{ is the predicted (prior) 
mean. For now, we assume that such a representation exists; discussion about how to 
construct P^ is reserved for Section 4. To derive the hltering equations for the subspace 
coordinates cx.^, assume that the marginal posterior distribution for otk_i at time k — 1 
is 

Q:fc_i|yi,fc_i ~ N(a^_i, (13) 

Transforming this Gaussian distribution to the original coordinates using Xfc_i = 
x{_^ -|- PrQ;fc_i yields the following Gaussian distribution in the full state space: 

~ N(xy, + (14) 

By propagating this Gaussian distribution forward with M^., we obtain the mean and 
the covariance matrix of the predictive distribution Xfc|yi:fe_i ~ N(x{, C{) in the original 
coordinates: 

x{ = Mfc(x{_i + P^a“_J (15) 

cl = (M,P,)T'^,(MfcP,)^ + Qfc. (16) 

Applying this as the prior, and inserting the parameterization x^ = x{ -t-P^oifc into (12) 
yields the following marginal posterior density for ck^: 

p{cxk\yi:k) OC exp (^HP^oifcll^/ -F Hy^ - H^xf - HfcP^afcUR^. (17) 

This is equivalent to a linear problem with Gaussian likelihood y^ — H^x^ ~ 
N(HkPrCXk, Rfc) and zero mean Gaussian prior cxk N(0.(Pyc{)-'P.)-'). The 
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n= ((HtP,gRj‘(HiP,) + P;(C{)-‘P,)'‘, (18) 

< = 4’J(HtP,)TR-‘ri, (19) 

and ffc = Yfc — is the prediction residual. Note that now does not whiten the 

prior, in contrast with the prior-based dimension reduction discussed in Section 2.1 for 
static problems, and thus the term Pj(C{)“^Pr is not equal to the identity. Moreover, 
the matrix C{ cannot be formed explicitly in high-dimensional problems. To efficiently 
evaluate (C{)^^Pr, we recall that here 

~ -|- Qfc = (MfcPrAfc)(MfcPrAfc)'^ -|- Q^, (20) 

where A^ G is the matrix square root ~ ^k-^k- To shorten the notation, 

let us denote B^. = M^PrA^ G Now, applying the Sherman-Morrison-Woodbury 

matrix inversion formula yields 

(C{)-‘ = (BiB[ + Q,)-' = Qp-QpB*(BrQi‘B, + P)-‘BrQp,(21) 

where 1^ is the r x r identity matrix. Now, the matrix -|- I,, that needs to be 

inverted is only r x r. Thus, a product (C{)“^b can be efficiently computed as long as 
is easy to compute. This condition must hold in order for this technique to work. 
In practice, is usually a simple (e.g., diagonal) matrix postulated by the user. 

To sum up, a single step of the reduced Kalman filter with a fixed basis is given as 
an algorithm below. Assume that we have the previous estimate and its covariance 
matrix available. Then the algorithm reads as follows: 

Algorithm 1: one step of the reduced Kalman filter. 

Input: cy.l_^ and Output: cx^ and 

(i) Compute the prior mean x{ = Mfc(x{_^ -|- Pj.Q:fc_;^). 

(ii) Perform the decomposition = A^Aj. 

(hi) Compute the matrix B^ = MfcPrAfc. 

(iv) Compute (C{)“^Pr via the matrix inversion formula (21). 

(v) Compute and via formulas (18)-(19). 


Remark 1. Computationally, this version is much lighter than the standard Kalman 
hlter if r -C d. In the standard Kalman filter, the prediction covariance matrix is 
computed as C{ = -|- Q^; that is, we need to compute products of d x d 

matrices (or to apply the forward model to the d columns of C^_]^). Moreover, when 
updating the prior covariance, one needs to operate with and solve a system of m 
linear equations. In the approach described here, one needs to work with dxr matrices 
Bfc, solve a system of r linear equations, and do one inversion of a r x r matrix. Also, 
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here the basis vectors are hxed, so we avoid solving local eigenvalue problems, which 
are needed, for instance, in the reduced rank Kalman hlter of [18]. 

Remark 2. Here, the parameterization is centered at the predicted mean, = 
+ PrCKfc. An alternative would be to use a hxed mean, x^ = ^ + PrCKfc, where 
is some hxed ohset. The former looks for a correction to the predicted mean in 
the subspace, whereas the latter attempts to describe the state vector itself in a hxed 
subspace. In our experiments, the former yields much better hlter accuracy, especially 
with small r. 


3. Extensions to nonlinear problems 

Here, we discuss two ways to extend the dimension reduction idea to problems where the 
evolution and/or observation models are nonlinear. We still assume additive Gaussian 
model and observation errors, so our state space model now reads as 

Xfc = Al(xfc_i) + Efc (22) 

Yk = 'H(xfc) + Ofc, (23) 

where JOl and "H are the nonlinear forward and observation models. We start by 
discussing the extended Kalman hlter, which requires linearizations of the forward and 
observation models. Then, we discuss ensemble hltering techniques where linearizations 
are not needed. 


3.1. Extended Kalman filtering 


The extended Kalman hlter (EKF) replaces the model and observation matrices in the 
KF with their linearized versions. Thus, the algorithm is the same as Algorithm 1 
in Section 2, but the mean is propagated with the nonlinear forward model and the 
prediction residual is calculated with the nonlinear observation model. Elsewhere 
and Hfc are replaced with 


Mk 


aWl(Xfc_i) 

dxk-i 


Xfc-l=x“_j 


Hfc 


5^(Xfc) 


(9xfc 


Xfc=x{ 


(24) 


Note that for large scale problems, computing the above matrices explicitly is not 
feasible. Instead, one often derives the linearized model analytically (as an operator), 
which makes it possible to propagate vectors forward with the linear model, which is 
equivalent to computing products M^b where b G Here, these tangent linear 

codes need to be applied in steps (iii) and (v) of Algorithm 1. In step (hi), we compute 
Bfc = MfcP^Afc, which can be done by applying the linearized forward model to the 
r columns of PrA*,. In step (v), we need HfcP^, which can be computed by applying 
the linearized observation model to the r columns of P^. In both cases, we only need 
to propagate r vectors through the linearized models, instead of d vectors as in the 
standard EKF. 
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Ensemble filters have become popular for solving very high-dimensional dynamical 
state estimation problems arising in geophysical applications such as numerical weather 
prediction. The development started started from the ensemble Kalman hlter (EnKF, 
[13, 26]) in the 1990s, and different variants are under active development. The idea of 
the EnKF is to represent the state and its uncertainty with samples (an “ensemble” of 
states), and, roughly speaking, to replace the covariances in the hltering formulas with 
their empirical estimates calculated from the samples. 

For high-dimensional problems that involve complex physical models, the ensemble 
size is necessarily much smaller than the dimension of the problem. As a result, 
the obtained covariance estimates are rank-dehcient and can suffer from “spurious 
correlations” (unphysical correlations appearing randomly due to small sample size); 
see, e.g., [1, 16, 24] for discussion. To overcome these issues, various localization 
techniques have been proposed, where the empirical covariance estimates are regularised 
by, for instance, explicitly removing unrealistic distant correlations from the covariance 
matrices; see [2, 21, 36]. Recently, adaptive localization techniques have also been 
developed, where the localization mechanism is tuned on-line in the hlter [7, 8]. 
Localization is one of the key techniques to make EnKFs work for small ensemble sizes. 

In addition to rank dehciency, the classical EnKF suffers from sampling error, 
as the observation and model errors are accounted for by randomly perturbing the 
observations and predicted ensemble members during the estimation. To avoid this 
additional variance in the resulting estimates, so-called square root EnKFs have been 
developed, which are deterministic schemes where no such random perturbations are 
used for the observations [1, 6, 15]. Accounting for model error still remains a difficulty, 
although some techniques have been recently proposed; see, e.g., [38] and the references 
therein. 

In the method discussed here, the state is constrained onto a subspace, which 
heavily regularises the estimation problem; for instance, if the state is smooth, the rough 
features are explicitly removed from the estimation problem, and all of the information 
in the samples can be used for inferring the smooth features. As a result, the need for 
localization is diminished, as demonstrated in the numerical examples in Section 6.1. 
Moreover, model error can be can be easily included in the approach, provided that we 
are able to apply the model error covariance matrix to a vector efficiently. 

Here, we present one way of extending the dimension reduction idea to ensemble 
hltering. Let us now consider the state space model where the forward model is nonlinear 
but the observation model is linear (the nonlinear observation model case is discussed 
later): 

Xfc = >l(xfc-i)Efc (25) 

Yk = HfcXfc -h efc. (26) 

In ensemble hltering, we represent the distribution of ock with samples. Let us assume 
that at time step k—1 we have posterior samples ^ available. 
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sampled from the Gaussian posterior N(q:^_^, These obviously correspond to 

samples + PrOc^i^ in the full state space. In the prediction step, we move 

the posterior samples forward with the dynamical model; x{’* = Then, we 

compute the empirical covariance matrix of the predicted samples and add the model 
error covariance to obtain the prediction error covariance: 

d = Gov (M(xfc_i) + Efc) ^ XfcXj + Qfc, (27) 


where 


Xfc 









Thus, XfcX^ is the empirical covariance estimate computed from the prediction 
ensemble. The mean x{ is taken to be the posterior mean from the previous step 
propagated via the evolution model, x{ = = A4(xl_^ + PrQ;^_^), instead of 

the empirical mean computed from the prediction ensemble, which is why we divide 
by \/iVens instead of V^ens — 1, see the remarks below for more discussion about this 
choice. 

If the observation model is linear, the reduced dimension ensemble hltering 
algorithm stays almost the same as the reduced KF algorithm (Algorithm 1 in Section 2). 
The only difference is the prior covariance matrix C{, which is now dehned via the 
prediction ensemble. When Ng^s -C d, operating with (C{)“^, needed in step (v) of 
Algorithm 1, can still be done efficiently via the Sherman-Morrison-Woodbury inversion 
formula: 


(C'r‘ = (XtX^+Q,)-‘ = Qt'-Qt'Xt(XjQ^'Xt+I„_.)-‘xrQt'.(29) 

Note that now, when applying (C{)“^ to a vector, we are left with the inversion of an 
Ngns X Nens matrix instead of an r x r matrix. 

To summarise, one step of the ensemble Kalman hlter with dimension reduction is 
given below. 


Algorithm 2: one step of the ensemble Kalman filter with reduced 
dimension. Input: Q:^_^ and Output: ct^ and 


(i) Draw Aens samples from N(q:^_;^, and transform samples 

into the full state space: = xl_^ + P^cfi^d 

(ii) Gompute the prior mean x{ = At(x{_^ + PrOL‘^_i) and propagate the 
samples: x{’* = M.(xl’\). 


(hi) Form Xi, = 




4) ••• C 




XI 


/VK, 


(iv) Gompute (C{) via the matrix inversion formula (29). 

(v) Gompute and via formulas (18)-(19). 
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Remark 3. The computational cost of the ensemble algorithms is dictated by both the 
number of basis vectors r and the number of ensemble members A^ens- The computational 
cost is similar to that of the standard EnKF. What makes dimension reduction attractive 
from ensemble filtering point of view is that the number of samples needed to capture 
the distribution of the r-dimensional variable can be much smaller than the number 
of samples needed to get accurate filtering results for the d-dimensional variable in 
the full space. Thus, similar performance can be obtained with fewer ensemble members, 
as can be observed in the numerical examples in Sections 6.1-6.2. 

Remark 4. The ensemble hlter presented above differs from the classical ensemble 
Kalman filter (EnKF) developed in [13, 26]. The classical EnKF is a non-Gaussian 
filter; it applies a linear update with perturbed observations to non-Gaussian prediction 
samples to get the posterior ensemble. The version presented here is a Gaussian filter 
in the sense that the prior is assumed to be a Gaussian whose covariance matrix is 
estimated from the prediction ensemble. 

Remark 5. Another difference between the proposed method and many other ensemble 
hlters is that the prior mean here is x{ = AT(x^_j^) instead of the empirical ensemble 
mean used, for instance, in the classical EnKF. Technically, one could easily choose the 
ensemble mean as x{ as well. However, we have noticed that the choice x{ = 
for propagating the mean, analogous to EKE and variational (3D-Var and 4D-Var) 
methods, works better for many problems. One reason might be that the x{ obtained 
this way lies close to the attractor of the forward model, whereas the sample mean 
obtained from a small number of ensemble members might be further away from it and 
thus represent an “unphysical” state. A similar approach was taken in some recently 
developed filtering algorithms; see, for instance, [44] for some discussion. 

Remark 6. If the observation model is nonlinear, the posterior distribution for ctk is no 
longer Gaussian. Then, one way forward is to apply the linearized observation operator 
as in the EKE to obtain a Gaussian approximation of the posterior, and to sample new 
members from the Gaussian. Another way is to sample new posterior samples for ctk 
directly from the non-Gaussian posterior 



p(afc|yi:fc) OC exp 


using, for instance, Markov chain Monte Garlo (MGMG) techniques, which should be 
feasible if r is not too large and l-i is relatively simple (note that evaluating the posterior 
density does not require the forward model Ai). For instance, novel optimization-based 
sampling techniques like [5] (which requires Gaussian priors as above) can potentially 
be used to generate posterior samples efficiently, as discussed in [44] in connection with 
high-dimensional filtering problems. 


Remark 7. The subspace representation also opens up a way to implement other 
sample-based filtering techniques for high-dimensional problems, such as the popular 
unscented Kalman filter (see, e.g., [45]), which, in the subspace version, would require 
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2r + 1 samples to propagate the covariance forward instead oi 2d+l as in the fnll state 
space version. Even particle filtering in the snbspace might be possible with a reasonable 
nnmber of particles. These ideas are left for fntnre stndy and not pnrsned fnrther here. 


4. Reduced subspace construction 

The reduced snbspace basis for representing the unknown state is constructed in 
a similar way as for static problems discussed in Section 2.1. Consider the covariance 
matrix S that represents the covariance structure of states, and in particular its 
eigendecomposition S = UAU^. We compute the basis from the r leading 
eigenvectors Ur = (ni,...,Ur) and square roots of the corresponding eigenvalues 
Ar = diag(Ai,...,Ar): 

Pr = UrAy2. (31) 

If the eigenvalues of the covariance matrix S decay quickly, the variation of states can 
be captured by a low-dimensional subspace spanned by the leading eigenvectors. We 
note that scaling the basis by the eigenvalues in (31) is not necessary, but can unify the 
scales of the different state variables. Of course, this method requires access to S. In 
the rest of this section, we present several ways to estimate this covariance matrix. 


4 . 1 . Principal component analysis 


Principal component analysis (PGA) can be applied to “snapshots”—which are 
possible model states obtained from existing model simulations—for constructing low¬ 
dimensional subspaces of high-dimensional dynamical systems. Depending on the field 
of application, this procedure is also named empirical orthogonal functions [37] in 
meteorology, or proper orthogonal decomposition [43, 14] in model reduction. The 
reduced basis obtained from PGA can then be used in either model reduction [47, 40] 
or hltering (e.g., the ROKF method [9] or our approach as presented in Section 2.2). 

In non-st at ionary inverse problems, our main interest is dynamical systems without 
steady states (e.g., chaotic models). In this setting, we use trajectories obtained from 
either a sufficiently long free model simulation or multiple model simulations with 
randomized initial conditions. Given a sufficient number of snapshots the 

subspace basis is computed using (31) via the eigendecomposition of empirical state 
covariance 


S 


1 

N-1 





X 


1 

N 



(32) 


2=1 2=1 

where x is the empirical state mean. For high-dimensional dynamical systems, it 
is not feasible to form the empirical covariance directly and apply dense matrix 
eigendecomposition methods. In this case, either Krylov subspace methods [22, 29] or 
randomized methods [23, 30] should be applied together with matrix-free operations— 
the matrix vector product with S—to compute the eigendecomposition. 
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The number of snapshots naturally can affect the quality of the basis P^, and a 
sufficiently large N should be chosen to capture the essential behavior of the model. 
Above, the basis is constructed from un-regularised empirical covariance estimates, 
without assuming any particular form for the covariance. If the number of snapshots 
available is limited, we can instead employ regularised covariance estimation techniques, 
where certain assumptions about the covariance structure are introduced to regularise 
the estimation problem. These techniques are discussed below. 

4-2. Regularized covariance estimation 

Another viable route for state covariance estimation is to infer the state correlation 
structure from snapshots and a priori information such as smoothness assumptions. 
Here, we discuss how Gaussian processes (GPs) can be used for the task. We consider 
two types of GPs: a stationary GP modeled by a kernel function [39], or a (possibly) 
non-stationary GP modeled by a a differential operator [41, 31]. 

4.2.1. Stationary GPs via covariance kernels. We hrst discuss the kernel approach, 
where each element of the covariance matrix is given by a kernel function k of the form 


k(^Si,Sj^0fi 


(33) 


where Sj, Sj G G are spatial locations used to discretize the states, and all the information 
about smoothness, correlation length, and variability can be encoded in the parameter 
6 . For example, one commonly used kernel function is the squared exponential kernel 



(34) 


where 6*i and 62 control the variability and correlation length, respectively, and d{si, Sj) 
is a distance between points Si and sj. With this kernel function, the eigenvalues decay 
quickly (exponentially), and it is easy to capture the kernel with a hnite basis. 

Given the mean of the GP, empirically estimated from the snapshots, we estimate 
the parameters 0 in a Bayesian framework. The likelihood function of 6 , given a 
snapshot collection takes the form 



Gombining the likelihood with a prior distribution p{0), we obtain the maximum a 
posteriori estimate 


0 = argmax^ p ... ,x*'^^|0) x p{0) 


(36) 


of the kernel parameters. The reduced subspace basis can then be computed from the 
eigendecomposition of the state covariance S(0). 
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Remark 8. Here the smoothness assumption is used to “hll the gap” between the 
unknown high-dimensional correlations of the state and the information provided by a 
limited number of snapshots. For a GP defined by stationary kernels, this assumption 
can be enforced by using a smooth kernel such as the squared exponential kernel. 

4-2.2. Nonstationary GPs via Gaussian Markov random fields. The stationary 
assumption we used in the above-mentioned kernel method may not be suitable 
for dynamical systems where the states have heterogeneous spatial correlations. 
Furthermore, operations with the dense covariance matrix S(0) in (35) can be 
computationally challenging for high-dimensional states, because the covariance matrix 
can be singular and computational costs of dense matrix operations—especially 
factorization and inversion—scale poorly with dimension. 

This motivates us to model the GP using the inverse of the covariance, i.e., the 
precision matrix so that Gaussian Markov random field (GMRF) models [41] 

can be used to construct the precision as a sparse matrix. We particularly mention the 
work of [31], in which the sparse precision matrix is constructed from the finite element 
discretization of the following stochastic partial differential equation (SPDE): 

( 7 (s)-V-(/s:(s)V))^a:(s) = >V(s), (37) 

where W(s) is a white noise process in space. For spatially constant 7 ( 5 ) = 7 and 
K,{s) = K, it can be shown that the solution u{s) of the SPDE (37) defines a Gaussian 
process with the Matern family of correlation functions; see [31] and references therein. 
The functions 7 ( 5 ) and k{s) together control the correlation length and variability of 
the GP, and the scalar a controls the smoothness, and therefore a nonstationary GP 
can be defined by prescribing spatially heterogeneous 7 ( 5 ) and k{s). 

For a positive integer a, discretizing the SPDE (37) yields a sparse precision matrix 
fl. Here, we discretize (37) using the hnite element method with linear basis functions, 

j 

j=i 

where Xj = x{sj), j = 1 ,..., J, is a set of nodal points and (j)j{s), j = 1 ,..., J, is a set 
of linear basis functions associated with the nodal points. Here the nodes are defined to 
coincide with the locations of the states in the filtering problem. The precision matrix 
can be constructed given the mass matrix and the stiffness matrix of the hnite element 
discretization, which are dehned as 

Aij = j 'y{s)(j)i{s)(j)j{s)ds, and = j [V ■ (K(s)V)0j(s)] 0j(s)ds, 

where z, j = 1,..., J. The functions 7 ( 5 ) and k{s) can also be discretized by linear basis 
functions, which are given as 

J ,7 

7(s) = X]7i0j(s), and k{s) = 

i=i i=i 
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respectively. This yields the local mass and stiffness matrices 
= I <t’t(s)4‘M4‘j{s)ds, 

Kt = I iv-{Ms)^)4>M]hWds, 

snch that the overall mass and stiffness matrices can be written as 

./ 

A( 7 ) = 5 ^ 7 fcA^ and K(/^) = «:, K^ (38) 

k=l k=l 

where 7 = [ 7 ( 31 ), 7 ( 57 )]"'" and k = [k(si), ..., Here, the local mass and 

stiffness matrices, and K^, can be precompnted for a given set of finite element 
basis fnnctions. Following the recursive definition given in [31], the precision matrix 
n{ a,^, K.) parameterized by a scalar a and vectors 7 and k, is 

n(a, 7 ,K) = K(k) + A(7), a = l, 

f^(a, 7 , k) = - 1,7, k)A( 7 )"^ (K(k;) + A( 7 )) , a > 1. (39) 


In the present work, we will pre-select the order of the differential operator by 
choosing an a value, a = a, where a G {2,3,4}. A larger exponent, e.g., a > 4, is not 
recommended as it will lead to a high computational cost in the following parameter 
estimation problem. For a fixed d, the parameters 7 and k, can be estimated in a 
Bayesian manner. The likelihood function for this estimation problem takes the form 


P (- 


41 ) 


AN) 


h,i^) = 


N 


V(27r)'^|ri(d,7,K)|exp 7,- fi) j .(40) 

Choosing spatially varying parameters 7 and k, provides the flexibility needed to model 
a non-stationary covariance structure. However, estimating 7 and k, from a limited 
number of snapshots is itself an ill-posed inverse problem, so priors must be assigned to 
these parameters to remove the ill-posedness. 

To limit the degrees of freedom in the estimation, we prescribe the function 7(5) 
to be a scalar, i.e., 7(5) = 7, and use only a spatially varying n{s) to control the 
nonstationarity of the resulting GP. We use the mass lumping technique to approximate 
the mass matrix rather than dealing with the computationally prohibitive inversion 
A( 7 )“^. For the case 7(5) = 7, the lumped mass matrix is given as 

( j J 

% E E 

k=l 1=1 


We use an exponential prior to enforce the positivity of 7. Furthermore, a smooth 
lognormal process is prescribed as the prior for k{s) to enforce the positive semi¬ 
definiteness of the stiffness matrix K(k). By defining u = log(At), the posterior 
distribution of the Bayesian inference problem can be written as 


p(7, 



oc 
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a/ {27iY |r 2 (d, 7 , 1^)1 exp 



7 , i/)(xW - (41) 


\ ^=1 / 

X p(7) X exp , 

where (i^o, fij,) define the mean and precision of the lognormal prior for k. The precision 
matrix f 2 (d, 7 , u) in (41) is given as 


ri(d, 7 ,i^) = K(exp(i/)) + Al( 7 ), d = 1, 

f2(Q;, 7 , u) = f2(d - 1, 7 , i^) Al( 7 )~^ (K(exp(i^)) + Al( 7 )) , d > 1. (42) 


We can then obtain the maximum a posteriori estimate of the GMRF parameters 
( 7 , i/) by maximizing the logarithm of the posterior density, which yields 

{7>} = argmax^^^logp(7, 

fl 1 ^ . 

= argmax^^^<^ - log(|rj(d, 7 , jz)|) - 2 

I ^ i=i 

- i/q) + logp(d) + logp( 7 ) 

The optimization problem (43) is continuons, and the gradient and Hessian of the 
objective with respect to 7 and v can be analytically derived (not reported here for 
brevity); hence, gradient-based nonlinear programing tools can be nsed to obtain an 
optimnm. In particnlar, we employ the snbspace trnst region method of [11, 12] with 
inexact Newton steps. Note that in the inexact Newton solve, we nse Hessian-vector 
prodncts rather than explicitly forming the fnll Hessian, to ensnre that compntational 
costs and memory reqnirements scale favorably with parameter dimension. 

Given the solntion of (43), the low-dimensional snbspace for onr rednced hlters can 
be compnted from the eigendecomposition of the estimated state covariance f 2 (d, 7 , i>)~^ 
nsing the matrix-free methods discnssed in the PGA case (Section 4.1). 

Remark 9. A snbtle issue in constructing the precision matrix is the choice of boundary 
condition for the SPDE (37). For a GP specihed with scalar 7 and k, prescribing 
a Dirichlet boundary condition will enforce zero variability on the boundary, and 
prescribing a no flux boundary condition—which is a common choice in the literature— 
will roughly double the variability at the boundary compare to the variability in the 
interior. Glearly both choices will result a GP that has nonstationary behavior near 
the boundary, even though the GP with scalar 7 and k dehned on an inhnite domain 
should be stationary in theory. In our numerical examples, boundary conditions do not 
present any difficulties, since we use periodic boundary conditions that are inherited 
from the structure of the corresponding data assimilation problem. In a more general 
setting, we recommend to use a zero-flux boundary condition and to let the data 
determine the nonstationary correlation structure through the solution of (43). This 
way, artifacts created by the boundary condition can be potentially compensated for 
via the inhomogeneous k{s) held. 
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The reduced-order Kalman filter (ROKF), developed in [9] and discussed further in [25] 
is, in principle, very similar to the dimension reduction approaches presented in this 
paper; like our approach, ROKF uses a hxed low-dimensional subspace to represent the 
state vector. Therefore, it is useful to discuss the differences between the techniques in 
more detail. 

To see the difference between the methods, it is instructive to compare what kind of 
priors (predictive distributions) the methods induce for the subspace coordinates. The 
prior mean is propagated in the same way, but the covariance is handled differently. 
In the ROKF, the prediction covariance in the reduced space, denoted here by is 
computed by evolving as follows: 

= {PjMkPr)n-l{'Pj^kPrV + PjQkPr (44) 

= PjClPr, (45) 

where C{ = {MkPr)^l_i{MkPry + Q,k is the posterior covariance of the previous time 
step propagated forward with M^. This has two interpretations: (a) projecting the 
forward dynamics onto the subspace spanned by the columns of and using it to 
propagate the reduced covariance forward, and (b) propagating the posterior covariance 
of the previous time step with the full model and projecting the resulting prediction 
covariance onto the subspace. 

On the other hand, the prediction covariance in our approach (see Section 2.2) is 

*1= (p;(Ci!)-‘P.)'‘. (46) 

That is, while ROKF projects the predicted covariance matrix C{, we project the 
predicted precision matrix (C{)“^. This difference has a nice geometric interpretation; 
projecting the covariance matrix is equivalent to marginalizing the prior onto the 
subspace, whereas projecting the precision matrix amounts to taking the conditional 
of the prior in the subspace. 

To visually see the difference, consider the following simple example. Assume that 
the full dimension of the state space is d = 2 and that the subspace (of dimension r = 1) 
is the x-axis: = [1,0]"'". The model is taken to be a 2 x 2 matrix with random 

entries sampled from N(0,1). We start with zero mean and covariance '^%_i in the 
subspace (x-axis), and propagate it forward with both the ROKF and our approach. 
The results are shown in Figure 1. One can clearly see that the prior induced by ROKF 
can be significantly different than the prior induced by our approach. Specihcally, the 
prior induced by the ROKF is always wider than the conditional prior of our approach. 
It is clear that marginalizing the prior can yield posterior estimates which are outside 
the essential support of the prior. 
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Figure 1. Initial covariance in the subspace (red) propagated forward with the 
dynamical model (blue). Propagated covariance in the subspace using the ROKF 
(green) and our approach (black). The lines and ellipse contain 95% of the probability 
mass of the associated 1-dimensional and 2-dimensional Gaussians. 


6. Numerical examples 

Here, we demonstrate the proposed dimension reduction algorithms with two synthetic 
hltering problems: a 240-dimensional version of the Lorenz model and a 1600- 
dimensional example using the quasi-geostrophic model. 

As the reference methods, we use the standard extended Kalman filter (EKF) and 
the standard ensemble Kalman hlter (EnKF); see [16]. The purpose of the experiments 
is to highlight some of the properties of the proposed approach, such as its behaviour 
with small sample sizes in ensemble hltering, rather than to draw conclusions about the 
performance of the approach relative to all the recent developments in the ensemble 
hltering literature. For this reason, and to keep the comparisons simple, we choose 
the well-known standard EnKF as the reference method instead of one of the many 
variants developed recently. A thorough performance comparison with all the recent 
developments in hltering methods is a challenging task (e.g., handling all the tuning 
issues of the various hlters) and left for future research. 

For the EnKF, we implement a simple and widely used localization scheme, obtained 
by tapering (setting the covariance between distant points to zero) the prediction 
covariance matrix using the 5th order piecewise rational function [20]. The correlation 
cut-oh length was chosen experimentally so that roughly optimal hlter performance was 
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6.1. Example 1: Lorenz model II 


6.1.1. Model description. As a small scale nonlinear example, we consider a generalized 
version of the Lorenz 96 model, the model II described in [33]. The evolntion model is 
given by an ODE system of N equations, each dehned as 

dX ^ ^ 

,, = i—Xn-2K-iXn-K-j+Xn-K+j-iXn+K+j)/K‘^—Xn+Fn, (47) 


j=-J i=-J 

where n = 1,... ,N and iL is a chosen odd integer and J = {K — l)/2. The variables 
are periodic: A_j = and Xat+j = Xj for i > 0. With K = 1, the system reduces 

to the standard Lorenz 96 model introduced in [32]. 

In our experiments we use a range of values for K, and choose the forcing Fn so 
that the model attains chaotic behaviour (verihed experimentally). In the prediction 
model used in the estimation, we use values K = 5,9,17, 33, 65 and the corresponding 
forcing values Fn = 10,10,12,14, 30 for all n. Increasing K introduces stronger spatial 
dependence between neighbouring variables, and yields spatially smoother solutions; 
example solutions with N = 240 and varying K are given in Figure 2. Controlling the 
smoothness allows us to demonstrate how the dimension reduction works in different 
cases: the smoother the unknown, the fewer basis vectors we need to describe it 
accurately. This is demonstrated in Figure 3, where we plot the fraction of the energy 
of the empirical covariance matrix of the model trajectories as a function of the number 
of basis vectors used in the representation. More precisely, we plot 
a function of r, where A* is the Ah largest eigenvalue of the empirical covariance matrix 
computed from model simulation output. One can see that with large K, most of the 
variability of the model state can be captured in a low-dimensional subspace, whereas 
with small K more basis vectors are needed. 


6.1.2. Experiment setup. In the experiments, we use values K = 5,9,17,33,65 and 
the corresponding forcing values Fn = 10,10,12,14, 30 for all n. We generate data for 
the estimation by simulating the model and adding 1 % normally distributed random 
perturbations to the forcing values to introduce error into the prediction model. The 
observation frequency is 0.05 time units (twice the time step used to solve the ODE) 
and every 10 th state variable is observed (Xi, Xu,..., X 231 ), which yields altogether 
24 measurements per observation time step. Data is generated for 20 time units, which 
yields altogether 400 observation times. The ODE was solved using the 4th order 
Runge-Kutta method. 

The model error covariance matrix used in the experiments is simply = fil for 
all fc, where fi is chosen experimentally from the interval fi G [0.01,0.3] so that roughly 
optimal tracking performance is obtained (RMS error between the estimates and the 
truth is minimised) for each hlter. 
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Figure 2. Solutions to the Lorenz model II with N = 240 at one time step with 
different values for K. Larger K yields smoother solutions. 



Figure 3. Cumulative energy of the empirical covariance matrix of the solution 
trajectories with varying K. 


The candidate subspaces were constructed by the PCA, GP, and GMRF techniques 
discussed in Section 4. For PGA, we used the trajectory obtained by simulating the 
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model for 1200 steps (that is, the covariance was estimated using 1200 samples). GP 
and GMRF parameters were htted using 32 snapshots of the model state. For the GP 
approach, the squared exponential covariance function was used, see equation (34), and 
the obtained estimates for the variance and correlation length parameters were 6 i = 6.42 
and 02 = 9.33. For the GMRF approach, we use an exponent of d = 2. 

6.1.3. Results: EKF. First, we compare the reduced EKF described in Sections 2 and 
3.1 to the full EKF for the K = 33 case, where the model state is spatially rather smooth 
and thus well described in a low-dimensional subspace; see Figures 2 and 3. The hrst 16 
basis vectors obtained via PGA, GP, and GMRF are given in Figure 4; the hrst vectors 
represent large scale smooth features and the later ones describe hner scale features. 


P1-P4 





P5-P8 P9-P12 








P13 P16 





Figure 4. The first 16 basis vectors obtained from PCA analysis and GP and GMRF 
fits. 


All methods were started from an all-zero initial state, xq = 0 and identity 
covariance Cq = I. In Figure 5, we compare the RMS error of the EKF and reduced EKF 
with varying numbers of basis vectors r, three ways for constructing the subspace (PGA, 
GP, and GMRF), and two ways to parameterize the unknown (solid lines: centered at 
the prior mean x^. = -|- PrCtfc, dashed lines: hxed offset x^ = ^ -|- P^oifc). As the hxed 

offset /X, we use the empirical mean of the simulated model trajectory for PGA and a 
constant /j, for GP and GMRF. 

We observe that centering the parameterization at the prior mean improves the 
results dramatically compared to hxed mean. With small r, the PGA basis works 
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slightly better than GP and GMRF. With the hxed mean parameterization, PGA and 
GMRF work roughly equally well, and GP a little worse. But all in all, the three 
different ways of constructing the subspace all yield similar results. 

In this example, we are able to obtain a reasonably accurate hlter even with 
r = 4 (!), whereas the hxed offset parameterization requires roughly r = 20 for similar 
accuracy. Thus, we are able to reduce the dimension and the computational complexity 
almost two orders of magnitude compared to the full state dimension N = 240. 



Figure 5. The average RMS error computed from steps 100-400 for the full EKF (red 
line) and for reduced EKF with increasing r (black lines). 


A summary of the results for all cases K = 5,9, 17, 33, 65 using the PGA basis is 
given in Figure 6. We plot the relative difference between the RMS error obtained with 
the full EKF and with the reduced EKF for all cases with varying r, using only the 
parameterization x^. = + 'PrOLk- We observe that with high K (when the trajectories 

are smooth), a small r is enough to capture the system; for instance, with K = 65 and 
K = 32, only roughly r = 8 vectors are needed to get an accurate hlter. With small K, 
however, the system contains more hne scale features and larger r is needed to obtain 
good hltering performance; for example, K = 5 requires roughly r = 85 for similar 
accuracy. This example illustrates how the efficiency of the proposed approach depends 
on the smoothness properties of the system. 

6 . 1 . 4 . Results: EnKF. Here, we compare the reduced EnKF described in Section 3.2 
to the standard EnKF. The results for the K = 33 case with varying r are given in 
Figure 7 for different ensemble sizes. We observe that the reduced EnKF works much 
better than the standard EnKFs with small ensemble sizes; with the reduced EnKF, 
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Figure 6. Relative difference of the mean RMS error obtained by reduced EKF 
compared to full EKF with various values for K and r. 


we are able to obtain a convergent filter with an ensemble size as small as Ne^s = 5, 
whereas the standard EnKF has problems converging with A'^ens < 20. For example, the 
rednced FnKF with r = 12 and Mens = 5 yields similar performance as full FnKF with 
i Vpn s = 100. Again, with sufficiently high Agns (here Apns > 100) the FnKF performance 
starts to catch up. 

Figure 7 also illustrates the interesting connection between r (number of basis 
vectors used) and N^ns- With smaller r, the performance that can be obtained is poorer, 
but, on the other hand, a smaller Mens is needed to achieve that performance. With 
larger r, better performance can be obtained, but only if Apn s is set high enough. That 
is, for each Mens, there seems to be an r that provides an optimal compromise between 
representation error (due to small r) and sampling error (due to small Nens)- 

We conclude that in ensemble hltering, the dimension reduction approach, when 
feasible, can offer a way to develop a reasonably accurate hlter with fewer ensemble 
members. Restricting the inference to a subspace can reduce the need for sample 
covariance matrix regularization via localization techniques, which otherwise are needed 
when high-dimensional hltering problems are solved with small ensemble sizes [2, 36]. 

6.2. Example 2: the two-layer quasi-geostrophic model 

Next, we test the subspace hltering algorithms using the two-layer quasi-geostrophic 
model (QG model) [17], which is often used as a benchmark system for data assimilation 
studies for numerical weather prediction (NWP). The model provides a reasonably 
good analogue of large-scale mid-latitude chaotic dynamics, while being relatively cheap 
computationally [19]. Next, we briehy describe the model equations and our estimation 
setup. For more details about the model as we use it, refer to [19]. 
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Figure 7. Mean RMS errors for full EnKF (with and without localization) and reduced 
EnKF with varying ensemble size and number of basis vectors r used. For iVens < 20, 
the EnKF results are cropped off because of filter divergence. 


6.2.1. Model description. The two-layer quasi-geostrophic model simulates atmo¬ 
spheric flow for the geostrophic (slow) wind motions. The geometrical domain of the 
model is specified by a cylindrical surface vertically divided into two “atmospheric” 
layers. The model also accounts for an orographic component that defines the surface 
irregularities affecting the bottom layer of the model. The latitudinal boundary con¬ 
ditions are periodic, whereas the values on the top and the bottom of the cylindrical 
domain are user-supplied constant values. The geometrical layout of the two-layer QG 
model mapped onto a plane is illustrated in Figure 8. In the figure, parameters Ui and 
U 2 denote mean zonal flows in the top and the bottom atmospheric layers, respectively. 
The model formulation we use is dimensionless, where the non-dimensionalization is 
defined by the length scale L, velocity scale U, and the layer depths Di and D 2 . 



Figure 8. Geometrical layout of the two-layer quasi-geostrophic model. 
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The model operates with variables called potential vorticity and stream function, 
where the latter is analogous to pressure. The model is formulated as a coupled system 
of PDEs (48) describing a conservation law for potential vorticity. The conservation law 
is given as 


PiQi _ „ ^ 2^2 
Dt ~ ' Dt 


(48) 


where Di denotes the substantial derivatives for latitudinal wind Ui and longitudinal 
wind Vi, dehned as ^ + Vi^] qi denote the potential vorticity functions; 

index i specihes the top atmospheric layer [i = 1) and the bottom layer {i = 2). 
Interaction between the layers, as well as relation between the potential vorticity g* and 
the stream function ipi, is modeled by the following system of PDEs: 


gi = VVi - -^1 (V’l - ^> 2 ) + fiy, (49) 

q2 = F2{i>2-i’l) + fiy + Rs- (so) 


Here Rg and fi denote dimensionless orography component and the northward gradient 
of the Coriolis parameter, which we hereafter denote as /q. The relations between the 
physical attributes and dimensionless parameters that appear in (49)~(50) are as follows: 


Fi = 


Rs = 


PoL^ 


gDi 
S{x,y) 
VD 2 


Fo = 




AO 


, 9 = 9 -^ 
9D2 9 


1 fi — fiojjy 


where A6 dehnes the potential temperature change across the layer interface, 6 is the 
mean potential temperature, g is acceleration of gravity, rj = jR is the Rossby number 
associated with the dehned system, and S{x,y) and fio are dimensional representations 
of Rs{x,y) and fi, respectively. 

The system of (48)~(50) dehnes the two-layer quasi-geostrophic model. The state 
of the model, and thus the target of estimation, is the stream function fii. For the 
numerical solution of the system, we consider potential vorticity functions qi and q 2 to 
be known, and invert the spatial equations (49) and (50) for fii. More precisely, we 
apply to equation (49) and subtract Fi times (50) and F 2 times (49) from the result, 
which yields the following equation: 

[vVi] - {F, + F2) = 

v^gi - F 2 {qi - fiy) - Fi {q 2 - fiy - Rs) • (51) 


Equation (51) can be treated as a non-homogeneous Helmholtz equation with negative 
parameter — (Fi + F 2 ) and unknown Once is solved, the stream function 

for the top atmospheric layer is determined by a Poisson equation. The stream function 
for the bottom layer can be found by plugging the obtained value for into (49), (50) 
and solving the equations for A- The potential vorticity functions qi are evolved over 
the time by a numerical advection procedure which models the conservation equations 
(48). 
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6.2.2. Experiment setup. We run the QG model with 20 x 40 grid in each layer, and the 
dimension of the state vector is thus 1600. To generate data, we run the model with 1 
hour time step using layer depths Di = 6000 and D2 = 4000. Data is generated at every 
6th step (hlter step is thus 6 hours) by adding random noise for 100 randomly chosen 
grid points with standard deviation a = 2.5 ■ 10“^. For the estimation, bias is introduced 
to the forward model by using wrong layer depths, Di = 5500 and D2 = 4500. The 
model error covariance matrix was Q = fil, where fi = 10“^ was chosen experimentally 
so that roughly optimal EKF performance was obtained (here, the tracking performance 
was quite insensitive to fi). A snapshot of the model simulation at one time step and 
the measurement locations are illustrated in Figure 9. 


Bottom layer 



Top layer 



10 20 30 40 


Figure 9. A snapshot of the true state. The measurement locations are given as black 
dots. Contour lines in the background represent stream function (target of estimation) 
and the filled contours represent the potential vorticity. 


6.2.3. Results: EKF. In Figure 10 we compare the mean RMS errors (computed 
over 400 hlter steps) of the full EKF and the reduced EKF with two different 
parameterizations and different numbers of basis vectors r used. Only the PGA basis 
is considered here. We observe, as in the other example, that the parameterization 
centered at the predicted mean works much better, especially with a small r. We are 
able to obtain reasonably accurate hltering results using only r = 20 basis vectors, which 
reduces the GPU time (and memory requirements) by almost two orders of magnitude 
compared to full EKF. Interestingly, with a sufficiently high r, the average RMS error 
is actually lower than with the full EKF. This can be explained by the additional prior 
information brought into the problem by restricting the inference onto a subspace. When 
r approaches the full dimension of the problem d, all methods agree. 

6.2.4. Results: EnKF. Next, we run the reduced EnKF using the parameterization 

centered at the predicted mean, + PrCKfc. In Figure 11, we plot the mean RMS 

error computed over 400 hlter steps for varying ensemble sizes and varying number of 
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Figure 10. Mean RMS errors of the full EKF and the reduced EKF with two 
parameterizations as a function of the number of basis vectors used. 


basis vectors r. The results qualitatively follow the same pattern as with the Lorenz 
model in Section 6.1; with large r, a larger ensemble size is needed to get an accurate 
hlter, and with small r, a smaller ensemble size is sufficient to get close to the optimal 
performance that is achievable with that r. For instance, with r = 50, ensemble size 
Nens = 10 yields better tracking performance than r = 200 with iVens = 200. 

The results are much better than what could be obtained with the standard EnKF; 
here, A^ens > 200 would be required even to get the standard EnKF to converge; see 
the results of [44]. Localization methods dramatically improve EnKF performance; for 
comparison, in Figure 11 we show the results for a simple localization, where we taper the 
prediction covariance matrix again using the 5th order piecewise rational function [20], 
experimentally tuning the cutoff length in the localization to achieve roughly optimal 
performance. However, the subspace algorithms still yield better results, as in the Lorenz 
example. Restricting the hltering onto a subspace regularizes the problem enough so 
that the need for localization is diminished. 

Figure 11 contains some results with ensemble size iVens = 0. Here, zero ensemble 
size means that no samples were used to propagate the uncertainty (only the posterior 
mean was propagated), and the prediction covariance was taken to be the model error 
directly: C{ = XfcX^ + Qfc = Q^. For small r, this simple 3D-Var type of strategy with a 
hxed prior was enough to get a convergent hlter. When the uncertainty propagation via 
samples was added and the sample size was increased, the hlter accuracy was improved, 
as expected. The larger the value of r, the more crucial the uncertainty propagation. 
This behavior can be explained by the fact that restricting the inference onto a subspace 
already heavily regularizes the problem, and thus propagating the covariance accurately 
is less important. For instance, using a small r restricts the inference to a subspace 
spanned by spatially smooth basis vectors. In the full space algorithms, such smoothness 
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information would be obtained by propagating the covariance forward in time. In the 
subspace method with small r, non-smooth directions are explicitly removed from the 
estimation problem, and even a simple, hxed prior can yield reasonably accurate results. 
Note also that here the number of observations is 100, which is, in many cases, larger 
than the dimension of the subspace, making the estimation problems numerically well 
posed and the role of the prior less important. 



Figure 11. Mean RMS errors as a function of ensemble size with varying number of 
basis vectors r used. Ensemble size 0 means that a fixed prior covariance was used, 
without error propagation: C{ = = Q^. 


7. Discussion and conclnsions 

In this paper, we presented an effective and simple-to-implement dimension reduction 
strategy for solving non-stationary inverse problems in a Bayesian framework. By 
identifying a global reduced subspace that captures the essential features of the state 
vectors, we provided a new subspace-constrained Bayesian estimation technique for 
reducing the computational cost of hltering algorithms. 

Our approach is hrst applied to the Kalman hlter for linear Gaussian models, and 
then generalized to nonlinear problems via the extended and ensemble Kalman hlters. 
In the Kalman hlter and extended Kalman hlter cases, the computational savings of 
our subspace-constrained technique is due to two sources: (a) the number of forward 
model simulations required in each prediction step is only equal to the reduced subspace 
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dimensions, as the error is only propagated along the coordinates of subspace basis; and 
{b) the update step can be formulated efficiently on the subspace coordinates. This 
way we also avoid handling matrices in the full dimension of the state space. In the 
ensemble version, computational savings stem from the fact that when the inference is 
constrained into a low-dimensional subspace, fewer ensemble members are needed for 
covariance estimation compared to the full space approach. Also, the need for covariance 
localization techniques to regularize the predicted covariance is diminished. 

Two approaches for constructing the reduced subspace are discussed. The first 
idea—widely used in model reduction community—is to obtain snapshots of typical 
model states (e.g., by performing a sufficiently long free model simulation) and to 
compute the leading eigenvectors of the resulting empirical state covariance matrix. The 
second idea is to infer the state covariance matrix from a limited number of snapshots 
using a Gaussian process hypothesis. This choice “hlls in” the missing information 
about model states (due to a limited number of snapshots) using the correlation 
structure encoded in the particular choice of GP. We discussed GP constructions using 
either stationary kernels that directly specify the covariance matrix, or non-stationary 
differential operators that correspond to sparse precision matrices. The GP construction 
also opens the door to other possible state covariance reconstruction approaches; for 
instance, one could infer the state covariance from previous data sets. We will investigate 
this extension in future work. 

We demonstrated the performance of our approach using two numerical examples. 
The first one is a 240-dimensional Lorenz system, where the smoothness of the model 
states can be controlled with a tuning parameter. This rather low-dimensional example 
is used to demonstrate the performance of our dimension reduction approach in various 
regimes. For smooth settings, the dimension can be reduced dramatically (to less than 
10) while still obtaining hltering accuracy—for both extended Kalman hltering and 
ensemble hltering—comparable to the full space algorithms. On the other hand, for non¬ 
smooth settings with “rough” features, the level of dimension reduction that maintains 
hlter accuracy becomes less dramatic. The second example is a 1600-dimensional two- 
layer quasi-geostrophic model, where the state dimension can be a reduced to about 30 
without losing hltering accuracy for both the extended and ensemble hlters, compared to 
their full space counterparts. A two order of magnitude reduction in computing time is 
achieved for extended Kalman hltering in this case. In ensemble hltering, our subspace 
approach yields accurate hltering results with smaller ensemble sizes than the standard 
EnKF with localization. 
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